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Abstract —Modern signal processing (SP) methods rely very 
heavily on probability and statistics to solve challenging SP 
problems. SP methods are now expected to deal with ever more 
complex models, requiring ever more sophisticated computa¬ 
tional inference techniques. This has driven the development 
of statistical SP methods based on stochastic simulation and 
optimization. Stochastic simulation and optimization algorithms 
are computationally intensive tools for performing statistical 
inference in models that are analytically intractable and beyond 
the scope of deterministic inference methods. They have been 
recently successfully applied to many difficult problems involving 
complex statistical models and sophisticated (often Bayesian) 
statistical inference techniques. This survey paper offers an 
introduction to stochastic simulation and optimization methods 
in signal and image processing. The paper addresses a variety of 
high-dimensional Markov chain Monte Carlo (MCMC) methods 
as well as deterministic surrogate methods, such as variational 
Bayes, the Bethe approach, belief and expectation propagation 
and approximate message passing algorithms. It also discusses a 
range of optimization methods that have been adopted to solve 
stochastic problems, as well as stochastic methods for deter¬ 
ministic optimization. Subsequently, areas of overlap between 
simulation and optimization, in particular optimization-within- 
MCMC and MCMC-driven optimization are discussed. 

Index Terms —Bayesian inference; Markov chain Monte Carlo; 
proximal optimization algorithms; variational algorithms for 
approximate inference. 

I. Introduction 

Modem signal processing (SP) methods, (we use SP here to 
cover all relevant signal and image processing problems), rely 
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very heavily on probabilistic and statistical tools; for example, 
they use stochastic models to represent the data observation 
process and the prior knowledge available, and they obtain 
solutions by performing statistical inference (e.g., using maxi¬ 
mum likelihood or Bayesian strategies). Statistical SP methods 
are, in particular, routinely applied to many and varied tasks 
and signal modalities, ranging from resolution enhancement of 
medical images to hyperspectral image unmixing; from user 
rating prediction to change detection in social networks; and 
from source separation in music analysis to speech recognition. 

However, the expectations and demands on the perfor¬ 
mance of such methods are constantly rising. SP methods 
are now expected to deal with challenging problems that 
require ever more complex models, and more importantly, 
ever more sophisticated computational inference techniques. 
This has driven the development of computationally intensive 
SP methods based on stochastic simulation and optimization. 
Stochastic simulation and optimization algorithms are compu¬ 
tationally intensive tools for performing statistical inference 
in models that are analytically intractable and beyond the 
scope of deterministic inference methods. They have been 
recently successfully applied to many difficult SP problems 
involving complex statistical models and sophisticated (of¬ 
ten Bayesian) statistical inference analyses. These problems 
can generally be formulated as inverse problems involving 
partially unknown observation processes and imprecise prior 
knowledge, for which they delivered accurate and insightful 
results. These stochastic algorithms are also closely related 
to the randomized, variational Bayes and message passing 
algorithms that are pushing forward the state of the art in 
approximate statistical inference for very large-scale prob¬ 
lems. The key thread that makes stochastic simulation and 
optimization methods appropriate for these applications is the 
complexity and high dimensionality involved. For example in 
the case of hyperspectral imaging the data being processed can 
involve images of 2048 by 1024 pixels across up to hundreds 
or thousands of wavelengths. 

This survey paper offers an introduction to stochastic simu¬ 
lation and optimization methods in signal and image process¬ 
ing. The paper addresses a variety of high-dimensional Markov 
chain Monte Carlo (MCMC) methods as well as deterministic 
surrogate methods, such as variational Bayes, the Bethe ap¬ 
proach, belief and expectation propagation and approximate 
message passing algorithms. It also discusses a range of 
stochastic optimization approaches. Subsequently, areas of 
overlap between simulation and optimization, in particular 
optimization-within-MCMC and MCMC-driven optimization 
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are discussed. Some methods such as sequential Monte Carlo 
methods or methods based on importance sampling are not 
considered in this survey mainly due to space limitations. 

This paper seeks to provide a survey of a variety of the 
algorithmic approaches in a tutorial fashion, as well as to high¬ 
light the state of the art, relationships between the methods, 
and potential future directions of research. In order to set the 
scene and inform our notation, consider an unknown random 
vector of interest x = [x\,..., xn] t and an observed data 
vector y = [y 1 ,..., 2/m] T , related to a; by a statistical model 
with likelihood function p(y\x; 6) potentially parametrized by 
a deterministic vector of parameters 6. Following a Bayesian 
inference approach, we model our prior knowledge about x 
with a prior distribution p{x\ 6), and our knowledge about x 
after observing y with the posterior distribution 


p(x\y;0) 


p(y\x- 0)p{x; G) 

Z(0) 


(1) 


where the normalising constant 


Z{0)=p(y;0) 


p(y\x;0)p(x;0)dx 


( 2 ) 


is known as the “evidence”, model likelihood, or the partition 
function. Although the integral in (0 suggests that all x 3 are 
continuous random variables, we allow any random variable 
Xj to be either continuous or discrete, and replace the integral 
with a sum as required. 

In many applications, we would like to evaluate the posterior 
p(x\y; G) or some summary of it, for instance point estimates 
of x such as the conditional mean (i.e., MMSE estimate) 
E{x\y;G}, uncertainty reports such as the conditional vari¬ 
ance var{x\y; G}, or expected log statistics as used in the 
expectation maximization (EM) algorithm M 

= argmaxEjlnpp, y\9)\y, G^} (3) 

0 


overlap between simulation and optimization, in particular 
the use of optimization techniques within MCMC algorithms 
and MCMC-driven optimization, and suggest some interesting 
areas worthy of exploration. Finally, in Section [Vl] we draw 
together thoughts, observations and conclusions. 

II. Stochastic Simulation methods 

Stochastic simulation methods are sophisticated random 
number generators that allow samples to be drawn from 
a user-specified target density n(x), such as the posterior 
p(x\y). These samples can then be used, for example, to 
approximate probabilities and expectations by Monte Carlo 
integration |4], Ch. 3]. In this section we will focus on Markov 
chain Monte Carlo (MCMC) methods, an important class of 
stochastic simulation techniques that operate by constructing 
a Markov chain with stationary distribution it. In particular, 
we concentrate on Metrogolis-Hastings algorithms for high¬ 
dimensional models (see [5|] for a more general recent review 
of MCMC methods). It is worth emphasizing, however, that we 
do not discuss many other important approaches to simulation 
that also arise often in signal processing applications, such as 
“particle filters” or sequential Monte Carlo methods |6|,|7|], and 
approximate Bayesian computation |@|. 

A cornerstone of MCMC methodology is the Metropolis- 
Hastings (MH) algorithm [44, Ch. 71 [ 9l 11 C)(l . a universal machine 
for constructing Markov chains with stationary density tt. 
Given some generic proposal distribution x* ~ q(■ x), the 
generic MH algorithm proceeds as follows. 


Algorithm 1 Metropolis-Hastings algorithm (generic version) 

Set an initial state x 
for t = 1 to T do 

Generate a candidate state x* from a proposal q(- |aP -1 )) 


where the expectation is taken with respect to p(x\y; G^). 
However, when the signal dimensionality N is large, the inte¬ 
gral in 0, as well as those used in the posterior summaries, 
are often computationally intractable. Hence, the interest in 
computationally efficient alternatives. An alternative that has 
received a lot of attention in the statistical SP community is 
maximum-a-posteriori (MAP) estimation. Unlike other poste¬ 
rior summaries, MAP estimates can be computed by finding 
the value of x maximizing p(x\y: G), which for many models 
is significantly more computationally tractable than numerical 
integration. In the sequel, we will suppress the dependence on 
G in the notation, since it is not of primary concern. 

The paper is organized as follows. After this brief introduc¬ 
tion where we have introduced the basic notation adopted. 
Section [II] discusses stochastic simulation methods, and in 
particular a variety of MCMC methods. In Section [[II] we 
discuss deterministic surrogate methods, such as variational 
Bayes, the Bethe approach, belief and expectation propaga¬ 
tion, and provide a brief summary of approximate message 
passing algorithms. Section m discusses a range of opti¬ 
mization methods for solving stochastic problems, as well 
as stochastic methods for solving deterministic optimization 
problems. Subsequently, in Section [V] we discuss areas of 


Compute the acceptance probability 

r t ) _ • Tip*) 

Generate ut ~ 7/(0,1) 

if u t < p ^ then 

Accept the candidate and set = x* 
else 

Reject the candidate and set = x ( t_1 ) 

end if 
end for 


Under mild conditions on q, the chains generated by Algo. 
Q] are ergodic and converge to the stationary distribution 
7T HU (m An important feature of this algorithm is that 
computing the acceptance probabilities pW does not require 
knowledge of the normalization constant of n (which is often 
not available in Bayesian inference problems). The intuition 
for the MH algorithm is that the algorithm proposes a stochas¬ 
tic perturbation to the state of the chain and applies a carefully 
defined decision rule to decide if this perturbation should 
be accepted or not. This decision rule, given by the random 
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accept-reject step in Algo. [Q ensures that at equilibrium the 
samples of the Markov chain have 7 r as marginal distribution. 

The specific choice of q will of course determine the 
efficiency and the convergence properties of the algorithm. 
Ideally one should choose q = tt to obtain a perfect sampler 
(i.e., with candidates accepted with probability 1); this is 
of course not practically feasible since the objective is to 
avoid the complexity of directly simulating from tt. In the 
remainder of this section we review strategies for specifying q 
for high-dimensional models, and discuss relative advantages 
and disadvantages. In order to compare and optimize the 
choice of q, a performance criterion needs to be chosen. A 
natural criterion is the stationary integrated autocorrelation 
time for some relevant scalar summary statistic g : R N —> R, 
i.e., 

OO 

T a = 1 + 2 ^ZCor{g{x {0) ),g(x {t) )} (4) 

t= 1 


with ®(°) ~ 7r, and where Cor(-, •) denotes the correlation 
operator. This criterion is directly related to the effective 
number of independent Monte Carlo samples produced by 
the MH algorithm, and therefore to the mean square error 
of the resulting Monte Carlo estimates. Unfortunately drawing 
conclusions directly from (0} is generally not possible because 
T g is highly dependent on the choice of g, with different 
choices often leading to contradictory results. Instead, MH 
algorithms are generally studied asymptotically in the infinite¬ 
dimensional model limit. More precisely, in the limit N —> 00 , 
the algorithms can be studied using diffusion process theory, 
where the dependence on g vanishes and all measures of 
efficiency become proportional to the diffusion speed. The 
“complexity” of the algorithms can then be defined as the 
rate at which efficiency deteriorates as N — > 00 , e.g., O(N) 
(see 13i l for an introduction to this topic and details about 
the relationship between the efficiency of MH algorithms and 
their average acceptance probabilities or acceptance rates 0. 

Finally, it is worth mentioning that despite the general¬ 
ity of this approach, there are some specific models for 
which conventional MH sampling is not possible because the 
computation of is intractable (e.g., when 7r involves an 
intractable function of x, such as the partition function of 
the Potts-Markov random field). This issue has received a 
lot of attention in the recent MCMC literature, and there are 
now several variations of the MH construction for intractable 

models iSIIMS- 


A. Random walk Metropolis-Hastings algorithms 

The so-called random walk Metropolis-Hastings (RWMH) 
algorithm is based on proposals of the form x* = '> + w. 
where typically w ~ A/"(0, X) for some positive-definite 
covariance matrix X 0 Ch. 7.5], This algorithm is one of 
the most widely used MCMC methods, perhaps because it has 
very robust convergence properties and a relatively low com¬ 
putational cost per iteration. It can be shown that the RWMH 

1 Notice that this measure of complexity of MCMC algorithms does not take 
into account the computational costs related to generating candidate states and 
evaluating their Metropolis-Hastings acceptance probabilities, which typically 
also scale at least linearly with the problem dimension N. 


algorithm is geometrically ergodic under mild conditions on 7r 
11711 - Geometric ergodicity is important because it guarantees 
a central limit theorem for the chains, and therefore that the 
samples can be used for Monte Carlo integration. However, 
the myopic nature of the random walk proposal means that 
the algorithm often requires a large number of iterations to 
explore the parameter space, and will tend to generate samples 
that are highly correlated, particularly if the dimension N is 
large (the performance of this algorithm generally deteriorates 
at rate O(N), which is worse than other more advanced 
stochastic simulation MH algorithms 0 ). This drawback can 
be partially mitigated by adjusting the proposal matrix X to 
approximate the covariance structure of n, and some adaptive 
versions of RWHM perform this adaptation automatically. For 
sufficiently smooth target densities, performance is further 
optimized by scaling X to achieve an acceptance probability 
of approximately 20% — 25% lITiil . 


B. Metropolis adjusted Langevin algorithms 

The Metropolis adjusted Langevin algorithm (MALA) is an 
advanced MH algorithm inspired by the Langevin diffusion 
process on R ;V , defined as the solution to the stochastic 
differential equation El 

dX(t) = ^X7 log n(X(t))dt + dW(t), X(0) = x 0 (5) 

where W is the Brownian motion process on R- v and Xi, € 
R jY denotes some initial condition. Under appropriate stability 
conditions, X(t) converges in distribution to 7r as t —» 00 , 
and is therefore potentially interesting for drawing samples 
from n. Since direct simulation from X(t) is only possible in 
very specific cases, we consider a discrete-time forward Euler 
approximation to © given by 

X< t+1 ) V + — V log 7r (iW) (6) 

where the parameter <5 controls the discrete-time increment. 
Under certain conditions on 7r and S, © produces a good 
approximation of X (t) and converges to a stationary density 
which is close to n. In MALA this approximation error is 
corrected by introducing an MH accept-reject step that guaran¬ 
tees convergence to the correct target density 7r. The resulting 
algorithm is equivalent to Algo. [0 above, with proposal 

q(x*\x ( ' t ~ 1 ' ) ) = 

1 ( H®* — — |Vlog7r (x^- 1 )) || 2 

exp { - 25 - 

(7) 

By analyzing the proposal © we notice that, in addition to 
the Langevin interpretation, MALA can also be interpreted as 
an MH algorithm that, at each iteration t, draws a candidate 
from a local quadratic approximation to log7r around x^~ l \ 
with (W 1 Ly as an approximation to the Hessian matrix. 

In addition, the MALA proposal can also be defined using 
a matrix-valued time step X. This modification is related to 
redefining © in an Euclidean space with the inner product 
(iu,X -1 ®) [20|]. Again, the matrix X should capture the 
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correlation structure of n to improve efficiency. For example, 
E can be the spectrally-positive version of the inverse Hessian 
matrix of log7r [21], or the inverse Fisher information matrix 
of the statistical observation model Eotl . Note that, in a similar 
fashion to preconditioning in optimization, using the exact full 
Hessian or Fisher information matrix is often too computation¬ 
ally expensive in high-dimensional settings and more efficient 
representations must be used instead. Alternatively, adaptive 
versions of MALA can learn a representation of the covariance 
structure online E 2 I 1 . For sufficiently smooth target densities, 
MALA’s performance can be further optimized by scaling E 
(or 6) to achieve an acceptance probability of approximately 
50% - 60% El. 

Finally, there has been significant empirical evidence that 
MALA can be very efficient for some models, particularly 
in high-dimensional settings and when the cost of computing 
the gradient V log tt(x) is low. Theoretically, for sufficiently 
smooth 7T, the complexity of MALA scales at rate 0(N 1 / 3 ) 
El . comparing very favorably to the 0(N ) rate of RWMH 
algorithms. However, the convergence properties of the con¬ 
ventional MALA are not as robust as those of the RWMH 
algorithm. In particular, MALA may fail to converge if the 
tails of 7r are super-Gaussian or heavy-tailed, or if S is chosen 
too large 0. Similarly, MALA might also perform poorly 
if 7 r is not sufficiently smooth, or multi-modal. Improving 
MALA’s convergence properties is an active research topic. 
Many limitations of the original MALA algorithm can now 
be avoided by using more advanced versions [2(| 24-271. 


C. Hamiltonian Monte Carlo 


a leap-frog approximation detailed in [28] 

w (t+5/2) _ w (t) _|_ ^Va;l0g7r 
x (t+5) = x (t) + S^ w (t+8/2) 

^(*+<5) _ w (t+S/2) _|_ log 7 T ^tE (t+<5 ^ 


( 9 ) 


where again the parameter S controls the discrete-time in¬ 
crement. The approximation error introduced by © is then 
corrected with an MH step targeting n(x,w). This algorithm 
is summarized in Algo.[2]below (see 128] for details about the 
derivation of the acceptance probability). 


Algorithm 2 Hamiltonian Monte Carlo (with leap-frog) 

Set an initial state x^°\ S > 0, and L € N. 
for t = 1 to T do 

Refresh the auxiliary variable w ~ Af(0, S). 

Generate a candidate (x*,w*) by propagating the current 
state with L leap-frog steps of length 5 

defined in ©. 

Compute the acceptance probability 


= min 



tt(x*,w*) \ 

w)) 


Generate u t ~ U( 0,1). 

if ut < P < ' t ' 1 then 

Accept the candidate and set = x*. 
else 

Reject the candidate and set x^ = x 

end if 
end for 


The Hamiltonian Monte Carlo (HMC) method is a very 
elegant and successful instance of an MH algorithm based 
on auxiliary variables 128]. Let w £ R w , E G M. NxN 
positive definite, and consider the augmented target density 
n(x,w) oc 7r(x) exp(— t;W T 'E~ 1 w), which admits the de¬ 
sired target density 7r(a:) as marginal. The HMC method is 
based on the observation that the trajectories defined by the 
so-called Hamiltonian dynamics preserve the level sets of 
tt(x,w). A point (ato,nto) £ R 2W that evolves according 
to the differential equations ® during some simulation time 
period (0, £] 


da; 

dt 

dm 

dt 


— V TO log7r(a;, w) = E 1 w 
Va, log n(x, w) = V x \ogir(x) 


( 8 ) 


yields a point (x tl w t ) that verifies Tr(x t ,w t ) = tt(xo,Wq). 
In MCMC terms, the deterministic proposal ® has n(x, w) 
as invariant distribution. Exploiting this property for stochastic 
simulation, the HMC algorithm combines © with a stochastic 
sampling step, w ~ A/”(0, E), that also has invariant distribu¬ 
tion 7r(a;, w), and that will produce an ergodic chain. Finally, 
as with the Langevin diffusion ©, it is generally not possible 
to solve the Hamiltonian equations ® exactly. Instead, we use 


Note that to obtain samples from the marginal tv(x) it 
is sufficient to project the augmented samples (x^,w^') 
onto the original space of x (i.e., by discarding the auxiliary 
variables w *■*)). It is also worth mentioning that under some 
regularity condition on 7r(a;), the leap-frog approximation 
© is time-reversible and volume-preserving, and that these 
properties are key to the validity of the HMC algorithm [28]. 

Finally, there has been a large body of empirical evidence 
supporting HMC, particularly for high-dimensional models. 
Unfortunately, its theoretical convergence properties are much 
less well understood |29|]. It has been recently established 
that for certain target densities the complexity of HMC scales 
at rate 0(N}/ 4 ), comparing favorably with MALA’s rate 
0(N 1 ^ 3 ) J30Q. However, it has also been observed that, as 
with MALA, HMC may fail to converge if the tails of n are 
super-Gaussian or heavy-tailed, or if S is chosen too large. 
HMC may also perform poorly if tr is multi-modal, or not 
sufficiently smooth. 

Of course, the practical performance of HMC also depends 
strongly on the algorithm parameters E, L and 6 112911 ■ The 
covariance matrix E should be designed to model the correla¬ 
tion structure of 7 t(x), which can be determined by performing 
pilot runs, or alternatively by using the strategies described 
in 0 ES 0. The parameters S and L should be adjusted 
to obtain an average acceptance probability of approximately 
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60% - 70% 0. Again, this can be achieved by performing 
pilot runs, or by using an adaptive HMC algorithm that adjusts 
these parameters automatically 132, 331. 


D. Gibbs sampling 

The Gibbs sampler (GS) is another very widely used MH 
algorithm which operates by updating the elements of x 
individually, or by groups, using the appropriate conditional 
distributions [ J, Ch. 10]. This divide-and-conquer strategy 
often leads to important efficiency gains, particularly if the 
conditional densities involved are “simple”, in the sense that 
it is computationally straightforward to draw samples from 
them. For illustration, suppose that we split the elements of 
x in three groups x = (xi,X2, X3 ), and that by doing so we 
obtain three conditional densities tt{xi\x 2 , * 3 ), 7r(a;2|tEi, * 3 ), 
and ■k(x 3 \xi,X 2 ) that are “simple” to sample. Using this 
decomposition, the GS targeting 7r proceeds as in Algo. [3] 
Somewhat surprisingly, the Markov kernel resulting from 
concatenating the component-wise kernels admits ir(x) as 
joint invariant distribution, and thus the chain produced by 
Algo. m has the desired target density (see 0 Ch. 10] for a 
review of the theory behind this algorithm). This fundamental 
property holds even if the simulation from the conditionals is 
done by using other MCMC algorithms (e.g., RWMH, MALA 
or HMC steps targeting the conditional densities), though this 
may result in a deterioration of the algorithm convergence rate. 
Similarly, the property also holds if the frequency and order of 
the updates is scheduled randomly and adaptively to improve 
the overall performance. 


E. Partially collapsed Gibbs sampling 

The partially collapsed Gibbs sampler (PCGS) is a recent 
development in MCMC theory that seeks to address some of 
the limitations of the conventional GS [36] - As mentioned 
previously, the GS performs poorly if strongly correlated 
elements of x are updated separately, as this leads to chains 
that are highly correlated and to an inefficient exploration 
of the parameter space. However, updating several elements 
together can also be computationally expensive, particularly if 
it requires simulating from difficult conditional distributions. 
In collapsed samplers, this drawback is addressed by carefully 
replacing one or several conditional densities by partially 
collapsed, or marginalized conditional distributions. 

For illustration, suppose that in our previous example the 
subvectors x, \ and exhibit strong dependencies, and that as 
a result the GS of Algo. [3 performs poorly. Assume that we 
are able to draw samples from the marginalized conditional 
density 7r(xi|a:3) = f 7t(xi, x 2 \x 3 )&X 2 , which does not 
depend on X 2 . This leads to the PCGS described in Algo. 
H] to sample from 7 r, which “partially collapses” Algo. 0 by 
replacing tt(xi\x 2 ,x$) with tt(xi\x 3 ). 


Algorithm 4 Partially collapsed Gibbs sampler 

Set an initial state x = (x[ n \ x 2 °\ x^) 
for t, = 1 to T do 

Generate x^ ~ 7 r fan 1^4* 

Generate x^ ~ 7r (x 2 \x^\x^ 

Generate x^ ~ 7r (x^lx^, xif^ 

end for 


Algorithm 3 Gibbs sampler algorithm 

Set an initial state x^ = (a4°\ x 2 °\ a4°4 
for t = 1 to T do 

Generate ~ 7 r (xi\x 2 1 \ a: 3 * ^ 

Generate x 2 ^ ~ 7 r (x 2 \x { f > , x$ 

Generate x^ ~ 7 r yx 3 \x^\ x 2 ^ 

end for 


As with other MH algorithms, the performance of the 
GS depends on the correlation structure of 7 r. Efficient sam¬ 
plers seek to update simultaneously the elements of x that 
are highly correlated with each other, and to update “slow- 
moving” elements more frequently. The structure of 7 r can be 
determined by pilot runs, or alternatively by using an adaptive 
GS that learns it online and that adapts the updating schedule 
accordingly as described in H. However, updating elements 
in parallel often involves simulating from more complicated 
conditional distributions, and thus introduces a computational 
overhead. Finally, it is worth noting that the GS is very useful 
for SP models, which typically have sparse conditional inde¬ 
pendence structures (e.g., Markovian properties) and conjugate 
priors and hyper-priors from the exponential family. This often 
leads to simple one-dimensional conditional distributions that 
can be updated in parallel by groups Uj, 35]. 


Van Dyk and Park |[36] established that the PCGS is always 
at least as efficient as the conventional GS, and it has been 
observed that that the PCGS is remarkably efficient for some 
statistical models 137. 38]. Unfortunately, PCGSs are not as 
widely applicable as GSs because they require simulating 
exactly from the partially collapsed conditional distributions. 
In general, using MCMC simulation (e.g., MH steps) within 
a PCGS will lead to an incorrect MCMC algorithm lf39tl . 
Similarly, altering the order of the updates (e.g., by permuting 
the simulations of X\ and x 2 in Algo. Q]i may also alter the 
target density 


III. Surrogates for stochastic simulation 
A. Variational Bayes 

In the variational Bayes (VB) approach described in 0 
Eh, the true posterior p(x\y) is approximated by a density 
q*(x) £ Q, where Q is a subset of valid densities on x. In 
particular, 

<7*0) = avgtmn D(q(x)\\p(x\y)) (10) 

where D(q\\p) denotes the Kullback-Leibler (KL) divergence 
between p and q. As a result of the optimization in ([TOl ) 
over a function, this is termed “variational Bayes” because 
of the relation to the calculus of variations B42I1 . Recalling 
that D(q\\p) reaches its minimum value of zero if and only if 
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p = q 143il . we see that g*(x) = p(x\y) when Q includes all 
valid densities on x. However, the premise is that p(x\y) is 
too difficult to work with, and so Q is chosen as a balance 
between fidelity and tractability. 

Note that the use of D(q\\p), rather than D(p\\q), implies a 
search for a q t that agrees with the true posterior p(x\y) over 
the set of x where p{x\y) is large. We will revisit this choice 
when discussing expectation propagation in Section IIII-EI 

Rather than working with the KL divergence directly, it is 
common to decompose it as follows 


D(q{x)\\p(x\y)) = J q(x) In da , = 1 nZ + F(q) 

( 11 ) 

where 

F(q) = [ q{x) In ^ - da; (12) 

J p(x,y) 

is known as the Gibbs free energy or variational free energy. 
Rearranging iflTl ). we see that 


— InZ = F(q) - D(q(x)\\p(x\y)) < F(q) (13) 

as a consequence of D(q(x) ||p(a;|y)) > 0. Thus, F(q) can 
be interpreted as an upper bound on the negative log partition. 
Also, because In Z is invariant to q. the optimization ( ITOb can 
be rewritten as 


q*{x) = argminF(g), (14) 

q£Q 

which avoids the difficult integral in ©. In the sequel, we will 
discuss several strategies to solve the variational optimization 
problem ( 1 1 4k 


B. The mean-field approximation 

A common choice of Q is the set of fully factorizable 
densities, resulting in the mean-field approximation |44|, [45]] 


N 


q( x ) = II 

3 = 1 


3 J- 


(15) 


Substituting (fl~5l > into ( IT 2 T> yields the mean-field free energy 


Fmf{q) = f [YlvM 


In 


1 


N 


P(x,y) 


dx-^2 h (qj) (16) 


.7=1 


where h{q 3 ) — — f Qj(xj ) hi qj(xj ) dx 3 denotes the differen¬ 
tial entropy. Furthermore, for j = 1,..., N, equation ( 1 1 61 can 
be written as 


^mf(<?) = D(q j (x j )\\g j (x j ,y)) -'Y^h(q i ) (17) 


M j 


9j 


i x j, y) — ex P / lnp(x,y)dx\j (18) 

J i±3 


where x\j = [xi, ..., Xj-i,Xj, ..., Xn] t for j = 1,N. 

Equation ( fTTb implies the optimality condition 


QjAxj) 


9jA x j,y) 

19jA x j^y) ^ 


Vj = l,...,N 


(19) 


where q±{x) = Jl/Li ) anc l where gj iir (xj,y) is defined 

as in m but with qi,*(xf) in place of qi(xf). Equation ( IT9l > 


suggests an iterative coordinate-ascent algorithm: update each 
component q 3 {xj) of q(x) while holding the others fixed. But 
this requires solving the integral in ( fl 8 ] >. A solution arises 
when Vj the conditionals p(xj,y\x\j) belong to the same 
exponential family of distributions 114611 , i.e., 

P(.Xj,y\x\j) oc h(xj) exp (rjixsj, y) T t(xj)) (20) 

where the sufficient statistic t(xj) parameterizes the family. 
The exponential family encompasses a broad range of dis¬ 
tributions, notably jointly Gaussian and multinomial p(x,y). 
Plugging p(x, y) = p{xj,y \ x\ :j )p{x\ 3 ,y) and © into © 
immediately gives 

9j(xj,y) oc h(xj) exp ( E{rj(x\j, y)} T t(xj)) (21) 

where the expectation is taken over x\j ~ Yiijtj Qi,*{ x i)- 
Thus, if each qj is chosen from the same family, i.e., qj{x 3 ) oc 
h(xj) exp (~fjt(xj)) Vj, then (IT9l > reduces to the moment¬ 
matching condition 

= E { r l( x \j,y)} I 22 ) 

where r y 1 + is the optimal value of 7 •. 


C. The Bethe approach 

In many cases, the fully factored model ( fl~5l > yields too gross 
of an approximation. As an alternative, one might try to fit a 
model q(x) that has a similar dependency structure as p(x\y). 
In the sequel, we assume that the true posterior factors as 

p(x\y) = Z-'Ylfaixa) (23) 

a 

where x a are subvectors of x (sometimes called cliques or 
outer clusters ) and f a are non-negative potential functions. 
Note that the factorization (l23l > defines a Gibbs random field 
whenp(a;|y) > 0. When a collection of variables {x n } always 
appears together in a factor, we can collect them into xp, 
an inner cluster, although it is not necessary to do so. For 
simplicity we will assume that these xp are non-overlapping 
(i.e., xp n xpr = 0 V/3 ^ /)'), so that {xp\ represents a 
partition of x. The factorization ( I2.H can then be drawn as a 
factor graph to help visualize the structure of the posterior, as 
in Figure Q] 



Fig. 1. An example of a factor graph, which is a bipartite graph consisting 
of variable nodes, (circles/ovals), and factor nodes, (boxes). In this example, 
x a = {ari, a?2}» x b — x c = {% 2 , x a}- There are several 

choices for the inner clusters xp. One is the full factorization x\ = x\, 
x 2 = x 2, x 3 = x 3 , and X4 = X4. Another is the partial factorization 
x 1 = xi, X2 = X2, and xs = {# 3 , X4}, which results in the “super node” in 
the dashed oval. Another is no factorization: x\ = {a?i, X2, xs, X4}, resulting 
in the “super node” in the dotted oval. In the latter case, we redefine each 
factor f a to have the full domain x (with trivial dependencies where needed). 
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We now seek a tractable way to build an approximation q(x) 
with the same dependency structure as (1231 . But rather than 
designing q(x ) as a whole, we design the cluster marginals, 
{q a (x a )} and {qp(xp)}, which must be non-negative, nor¬ 
malized, and consistent 

0 <q a (x a ), 0 < qp(xp) Va, /3,x a ,xp 


<l(x) = 


rig 9a ( X a) 


= F B ({q a j,{qp}) 


Belief propagation (BP) B491 50] is an algorithm for comput¬ 


a factor graph. The standard form of BP is given by the sum- 
product algorithm (SPA) [51], which computes the following 
messages from each factor node f a to each variable (super) 
node Xfl and vice versa 

ttla— >/3 {.XI foi(x a ) | Tnp/—t a i,Xp f ) 3x a \p (29) 


qp(xp) = f q a (x a )dx a \p Wa,/3 e<yi a ,xp (26) 

where x a \p gathers the components of x that are contained 
in the cluster a and not in the cluster /3, and 9T> denotes the 
neighborhood of the factor a (i.e., the set of inner clusters 3 
connected to a). 

In general, it is difficult to specify q(x) from its cluster 
marginals. However, in the special case that the factor graph 
has a tree structure (i.e., there is at most one path from one 
node in the graph to another), we have 0 


(24) 

m a — 

(25) 

mp- 

(26) 

These 


«0/3) n * 

ot' ^flp\cx. 


p'e*i a \p 
>p(xp). 


qp(xp) oc m a ^./ 3 (xp) 

q a {x a ) OC fa(Xa ) j jj. 


(30) 


(31) 


(32) 


(27) 


where Np = | 9 I< 3 | is the neighborhood size of the cluster /3. 
In this case, the free energy (fl2l simplifies to 

F(q) = ^2D(q a \\f a ) + ^2(Np - 1 )h{qp) +const, (28) 


where Fq is known as the Bethe free energy (BFE) ||47tl . 

Clearly, if the true posterior {/ Q } has a tree structure, 
and no constraints beyond (I24l - (l26l are placed on the cluster 
marginals {<?a}, {ffs}. then minimization of Lb({< 7 q }, {< 373 }) 
will recover the cluster marginals of the true posterior. But 
even when {/„} is not a tree, {qp}) can be used 

as an approximation of the Gibbs free energy F(q), and 
minimizing Fq can be interpreted as designing a q that locally 
matches the true posterior. 

The remaining question is how to efficiently minimize 
-fB({< 7 a}> {< 7 / 3 }) subject to the (linear) constraints (I24l i-(l26l>. 
Complicating matters is the fact that {<373}) is the 

sum of convex KL divergences and concave entropies. One op¬ 
tion is direct minimization using a “double loop” approach like 
the concave-convex procedure (CCCP) ||48j], where the outer 
loop linearizes the concave term about the current estimate 
and the inner loop solves the resulting convex optimization 
problem (typically with an iterative technique). Another option 
is belief propagation, which is described below. 

D. Belief propagation 


mp^ a (xp) 

pe<n a 

which must be normalized in accordance with ( 1251 ). The 
messages ([29l)- ([30b do not need to be normalized, although 
it is often done in practice to prevent numerical overflow. 

When the factor graph { f a } has a tree structure, the BP- 
computed marginals coincide with the true marginals after one 
round of forward and backward message passes. Thus, BP on 
a tree-graph is sometimes referred to as th e forward-backward 
algorithm, particularly in the context of hidden Markov models 
1520 • In the tree case, BP is akin to a dynamic programming 
algorithm that organizes the computations needed for marginal 
evaluation in a tractable manner. 

When the factor graph {f a } has cycles or “loops,” BP can 
still be applied by iterating the message computations ( l29l >- 
( |30| ) until convergence (not guaranteed), which is known as 
loopy BP (LBP). However, the corresponding beliefs (I3ll)-(l32l) 
are in general only approximations of the true marginals. This 
suboptimality is expected because exact marginal computation 
on a loopy graph is an NP-hard problem [53]. Still, the answers 
computed by LBP are in many cases very accurate 0 . 
For example, LBP methods have been successfully applied 
to communication and SP problems such as: turbo decoding 
155], LDPC decoding 1491 5611 . inference on Markov random 
fields |57], multiuser detection [58], and compressive sensing 


I511& 


ing (or approximating) marginal probability density functions 
(pdfsj! like qp(xp) and q a (x a ) by propagating messages on 

2 Note that another form of BP exists to compute the maximum a posteriori 
(MAP) estimate argmax® p(x\y) known as the “max-product” or “min- 
sum” algorithm (Hi. However, this approach does not address the problem of 
computing surrogates for stochastic methods, and so is not discussed further. 


Although the excellent performance of LBP was at first 
a mystery, it was later established that LBP minimizes the 
constrained BFE. More precisely, the fixed points of LBP 
coincide with the stationary points of Lb({<Z a}> { 9 /?}) from 
(l28l) under the constraints < 124b -(126b [47]. The link between 
LBP and BFE can be established through the Lagrangian 
formalism, which converts constrained BFE minimization to 
an unconstrained minimization through the incorporation of 
additional variables known as Lagrange multipliers [61]]. By 
setting the derivatives of the Lagrangian to zero, one obtains 
a set of equations that are equivalent to the message updates 
d29ll -(l3()t |{47i]. In particular, the stationary-point versions of 
the Lagrange multipliers equal the fixed-point versions of the 
loopy SPA log-messages. 

Note that, unlike the mean-held approach (IT5l) . the cluster- 
based nature of LBP does not facilitate an explicit description 
of the joint-posterior approximation q £ Q from (flOl) . The 
reason is that, when the factor graph is loopy, there is no 
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straightforward relationship between the joint posterior q(x) 
and the cluster marginals {q a (x a )}, {qp(xp)}, as explained 
before ( f27l) . Instead, it is better to interpret LBP as an efficient 
implementation of the Bethe approach from Section IIII-CI 
which aims for a local approximation of the true posterior. 

In summary, by constructing a factor graph with low¬ 
dimensional {xp} and applying BP or LBP, we trade the high¬ 
dimensional integral qp{xp) = f p(x\y) dx\p for a sequence 
of low-dimensional message computations ©-©. But d29}- 
(l30l > are themselves tractable only for a few families of {/ Q }. 
Typically, {/ Q } are limited to members of the exponential 
family closed under marginalization (see [6211 1. so that the 
updates of the message pdfs (|2L>1 >- (OTTt reduce to updates of 
the natural parameters (i.e., rj in <l20b ). The two most common 
instances are multivariate Gaussian pdfs and multinomial 
probability mass functions (pmfs). For both of these cases, 
when LBP converges, it tends to be much faster than double¬ 
loop algorithms like CCCP (see, e.g., l63ll l. However, LBP 
does not always converge [54]. 


E. Expectation propagation 


Expectation propagation (EP) [64] (see also the overviews 


162, 65]) is an iterative method of approximate inference that is 
reminiscent of LBP but has much more flexibility with regards 
to the modeling distributions. In EP, the true posterior p{x\y), 
which is assumed to factorize as in (l23l >. is approximated by 
q(x) such that 


q(x) oc Y\_m a (x a ) (33) 

a 

where x a are the same as in ( 123b and m a are referred to as 
“site approximations.” Although no constraints are imposed 
on the true-posterior factors {/ Q }, the approximation q(x) is 
restricted to a factorized exponential family. In particular. 


q ( x ) = II9/3O/3) 


(34) 


qp{xp) = exp (■y^tp(xp) - cpi'Yp)), V/3, (35) 

with some given base measure. We note that our description 
of EP applies to arbitrary partitions {xp}, from the trivial 
partition xp = x to the full partition xp = xp. 

The EP algorithm iterates the following updates over all 
factors a until convergence (not guaranteed) 


q^ a (x) £- q(x)/m a (x a ) 

(36) 

q^ix) £- q\ a {x)f a (x a ) 

(37) 

g new ( x) «— argminD(g\“(a;) q(a:)) 

(38) 

mT{x a ) «- q new (x)/q\ a (x) 

(39) 

q(x) «— q nevj (x) 

(40) 

m a (x a ) ■<— m?™(x a ) 

(41) 


where Q in < 1 38 b refers to the set of q(x) obeying (l34l >- (l35l >. 
Essentially, (l36l > removes the oth site approximation m a from 
the posterior model q , and ( 137 b replaces it with the true factor 
f a . Here, q'' n is know as the “cavity” distribution. The quantity 
is then projected onto the exponential family in (f38l> . 


The site approximation is then updated in (l39l >. and the old 
quantities are overwritten in d40l >- (l4TI >. Note that the right side 
of d39l > depends only on x a because q new (x) and q\ a (x) differ 
only over x a . Note also that the KL divergence in (l38l > is 
reversed relative to ®. 

The EP updates (f37l) - (m~b can be simplified by leveraging 
the factorized exponential family structure in (l34l >- (l35l >. First, 
for (l33l i to be consistent with (l34l >- (l35l >. each site approxima¬ 
tion must factor into exponential-family terms, i.e., 

iTia(x a ) — | m at p(x a ) (42) 

/3e9t Q 

m at p(x a ) = exp {pL^ ptp{xp)). (43) 

It can then be shown {62^\ that (I36l >- (l38l ) reduce to 

7 ^ ew G- argmax [^ T E^{tp{xp)} - cp{- 7 )] (44) 

for all ii € 7l Q , which can be interpreted as the moment match¬ 
ing condition E ^new {^(a^)} = E~\ a {tp(xp)}. Furthermore, 
(l39l > reduces to 


..new , -,new 

Hc,/3 7/3 


Ha',/3 


(45) 


for all /3 G 7t Q . Finally, (l40t and (|4H reduce to 7 p £- 7 " ew 
and y a p /x" e ™, respectively, for all /3 G 7T Q . 

Interestingly, in the case that the true factors {f a } are mem¬ 
bers of an exponential family closed under marginalization, the 
version of EP described above is equivalent to the SPA up to 
a change in message schedule. In particular, for each given 
factor node a, the SPA updates the outgoing message towards 
one variable node /3 per iteration, whereas EP simultaneously 
updates the outgoing messages in all directions, resulting in 
?n” ew (see, e.g., [41]). By restricting the optimization in (]39b 
to a single factor qp esN , EP can be made equivalent to the SPA. 
On the other hand, for generic factors {/ Q }, EP can be viewed 
as a tractable approximation of the (intractable) SPA. 

Although the above form of EP iterates serially through 
the factor nodes a, it is also possible to perform the updates 
in parallel, resulting in what is known as the expectation 
consistent (EC) approximation algorithm [66]. 

EP and EC have an interesting BFE interpretation. Whereas 
the fixed points of LBP coincide with the stationary points of 
the BFE (l28l > subject to (I24l >- (l25l > and strong consistency (l26i l. 
the fixed points of EP and EC coincide with the stationary 
points of the BFE (l28l i subject to (ED-CZ! and the weak 
consistency (i.e., moment-matching) constraint [67] 




,{t(xp)} = Eqp{t(xp)}, Va,/3e7I Q . (46) 

EP, like LBP, is not guaranteed to converge. Hence, provably 
convergence double-loop algorithms have been proposed that 
directly minimize the weakly constrained BFE, e.g., [67]. 


F. Approximate message passing 

So-called approximate message passing (AMP) algorithms 


159, 6QD have recently been developed for the separable 


generalized linear model 


N 


M 


p( x ) = Y[px(xj), p(y\ x ) = n Py\z(ym\a?nX) (47) 


j =1 
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where the prior p(x) is fully factorizable, as is the conditional 
pdf p(y\z) relating the observation vector y to the (hidden) 
transform output vector z = Ax, where A = [ai, am] t £ 
IK A 1 x N is a known linear transform. Like EP, AMP allows 
tractable inference under generi^ p x and p y \ z . 

AMP can be derived as an approximation of LBP on the 
factor graph constructed with inner clusters xp = xp for (3 = 
1 ,,N, with outer clusters x a = x for a = 1,..., M and 
x a = x a -M for a = M + 1,..., M + N, and with factors 


/ a (x Q ) 


Py\z(ya\ a Q x ) 

Px( x ct—M ) 


a = 1,..., M 
a = M + l,...,M + N. 


(48) 


In the large-system limit (LSL), i.e., M,N —> oo for fixed 
ratio M/N, the LBP beliefs qp(xp) simplify to 


qp(xp) otp x (xp)Af(xp;rp,T) (49) 

where {rp}^ =1 and r are iteratively updated parameters. 
Similarly, for m = 1,... ,M, the belief on z m , denoted by 
<Zz,m(')’ simplifies to 


qz,m (~m ) OC Py\z{ym\z m )Af Pmi 


(50) 


where {p m }m=i an d v are iteratively updated parameters. 
Each AMP iteration requires only one evaluation of the mean 
and variance of (l49l >- (l50l >. one multiplication by A and A 1 , 
and relatively few iterations, making it very computationally 
efficient, especially when these multiplications have fast im¬ 
plementations (e.g., using fast Fourier transforms and discrete 
wavelet transforms ). 

In the LSL under i.i.d sub-Gaussian A, AMP is fully 
characterized by a scalar state evolution (SE). When this SE 
has a unique fixed point, the marginal posterior approximations 
(l49l >- <r50b are known to be exact Ifijl IhOll . 

For generic A, AMP’s fixed points coincide with the 
stationary points of an LSL version of the BFE 0(lI|].When 
AMP converges, its posterior approximations are often very 
accurate (e.g., [72j]), but AMP does not always converge. In 
the special case of Gaussian likelihood p y | z and prior p x , AMP 
convergence is fully understood: convergence depends on the 
ratio of peak-to-average squared singular values of A, and 
convergence can be guaranteed for any A with appropriate 
damping [73]. For generic p x and p y | z , damping greatly helps 
convergence 0 but theoretical guarantees are lacking. A 
double-loop algorithm to directly minimize the LSL-BFE was 
recently proposed and shown to have global convergence for 
strictly log-concave p x and p y | z under generic A [75]. 


IV. Optimization methods 
A. Optimization problem 

The Monte Carlo methods described in Section [II] provide 
a general approach for estimating reliably posterior proba¬ 
bilities and expectations. However, their high computational 
cost often makes them unattractive for applications involving 
very high dimensionality or tight computing time constraints. 
One alternative strategy is to perform inference approximately 

3 More precisely, the AMP algorithm jf59tl handles Gaussian p y | z while the 
generalized AMP (GAMP) algorithm 116011 handles arbitrary Py\ z - 


by using deterministic surrogates as described in Section 
Hill Unfortunately, these faster inference methods are not as 
generally applicable, and because they rely on approximations, 
the resulting inferences can suffer from estimation bias. As 
already mentioned, if one focuses on the MAP estimator, 
efficient optimization techniques can be employed, which are 
often more computationally tractable than MCMC methods 
and, for which strong guarantees of convergence exist. In many 
SP applications, the computation of the MAP estimator of 
x can be formulated as an optimization problem having the 
following form 

minimize p(H x, y) + g(Dx) (51) 

where ip: K M x K M —> ]— oo,+oo], g: R p —> ]— oo, +oo], 
H £ R MxN , and D £ R PxAr with P £ N*. For example, 
H may be a linear operator modeling a degradation of the 
signal of interest, y a vector of observed data, ip a least- 
squares criterion corresponding to the negative log-likelihood 
associated with an additive zero-mean white Gaussian noise, 
g a sparsity promoting measure, e.g., an norm, and D a 
frame analysis transform or a gradient operator. 

Often, ip is an additively separable function, i.e., 

1 M 

y) = — Pi(zi, yi) V(z, y) £ (R M ) 2 (52) 

1 i=1 

where z = [z\, zm} T ■ Under this condition, the previous 
optimization problem becomes an instance of the more general 
stochastic one 


minimize T(.x) 

seM* 


g{Dx) 


involving the expectation 


$(x) = E {<Pj{hJx, yj )} 


(53) 


(54) 


where j, y, and H are now assumed to be random variables 
and the expectation is computed with respect to the joint 
distribution of ( j,y,H ), with h 1 - the j-th line of H. More 
precisely, when ( ]52] i holds, ( TSTb is then a special case of ( l53l l 
with j uniformly distributed over {1,..., M} and (y, H) de¬ 
terministic. Conversely, it is also possible to consider that j is 
deterministic and that for every i £ {2, ..., M}, <pi = pi, and 
(yi, h, ) [ <i<M are identically distributed random variables. In 
this second scenario, because of the separability condition 
(l52l >. the optimization problem ( l5ll can be regarded as a proxy 
for (|53| >. where the expectation <I>(.x) is approximated by a 
sample estimate (or stochastic approximation under suitable 
mixing assumptions). All these remarks illustrate the existing 
connections between problems (TsTb and ( I53| i. 

Note that the stochastic optimization problem defined in 
(|53| > has been extensively investigated in two communities: 
machine learning, and adaptive filtering, often under quite 
different practical assumptions on the forms of the functions 
{<Pj)j>i and g. In machine learning [76-78], x indeed rep¬ 
resents the vector of parameters of a classifier which has to 
be learnt, whereas in adaptive filtering 00 , it is generally 
the impulse response of an unknown filter which needs to 
be identified and possibly tracked. In order to simplify our 
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presentation, in the rest of this section, we will assume that 
the functions ( l fij)j>i are convex and Lipschitz-differentiable 
with respect to their first argument (for example, they may be 
logistic functions). 


B. Optimization algorithms for solving stochastic problems 

The main difficulty arising in the resolution of the stochastic 
optimization problem (l53l i is that the integral involved in the 
expectation term often cannot be computed in practice since it 
is generally high-dimensional and the underlying probability 
measure is usually unknown. Two main computational ap¬ 
proaches have been proposed in the literature to overcome 
this issue. The first idea is to approximate the expected loss 
function by using a finite set of observations and to minimize 
the associated empirical loss ([Til . The resulting deterministic 
optimization problem can then be solved by using either 
deterministic or stochastic algorithms, the latter being the topic 
of Section IIV-CI Here, we focus on the second family of 
methods grounded in stochastic approximation techniques to 
handle the expectation in (l54l >. More precisely, a sequence 
of identically distributed samples (yj, hj)j> \ is drawn, which 
are processed sequentially according to some update rule. The 
iterative algorithm aims to produce a sequence of random 
iterates {xj)j>i converging to a solution to (l53l i. 

We begin with a group of online learning algorithms based 
on extensions of the well-known stochastic gradient descent 
(SGD) approach. Then we will turn our attention to stochastic 
optimization techniques developed in the context of adaptive 
filtering. 

1) Online learning methods based on SGD: Let us assume 
that an estimate Uj € of the gradient of $ at xj is 
available at each iteration j > 1. A popular strategy for solving 
(l53l > in this context leverages the gradient estimates to derive 
a so-called stochastic forward-backward (SFB) scheme, (also 
sometimes called stochastic proximal gradient algorithm) 


(Wj >1) zj = prox 7 . ffor) (xj - 7 jUj) 

Xj +1 = (1 - A j)xj + XjZj (55) 

where ( r Yj)j>i is a sequence of positive stepsize values and 
(Xj)j>i is a sequence of relaxation parameters in ]0,1], Here- 
above, prox^,(i>) denotes the proximity operator at v £ R^ 
of a lower-semicontinuous convex function xj: —»] — 

oo, + 00 ] with nonempty domain, i.e., the unique minimizer of 
xj) + ||| ■ — v \\ 2 (see 18ill and the references therein), and 
g o D{ x) = g{Dx). A convergence analysis of the SFB 
scheme has been conducted in |82 h 851, under various assump¬ 
tions on the functions 'I 1 , g, on the stepsize sequence, and 
on the statistical properties of (uj)j> 1 . For example, if xj 
is set to a given (deterministic) value, the sequence (xj)j> 1 
generated by (IH is guaranteed to converge almost surely 
to a solution of Problem (l53l > under the following technical 
assumptions 84] 

(i) $ has a /3 _1 -Lipschitzian gradient with /3 e]0, +oo[, g is 
a lower-semicontinuous convex function, and <I> + g o D 
is strongly convex. 


(ii) For every j > 1, 


E{|K-|| 2 } < + 00 , E{ Uj | = V$( Xj ), 

E{||wj — V$(a:j)|| 2 | X i _ 1 }<a 2 (l + a j ||V$( :Ci )|| 2 ) 

where Xj = (y.j, /r,;)i< 7 ;<j, and atj and a are positive 
values such that 7 j < (2 —e)/3(lT2cr 2 ay) _1 with e > 0. 

(iii) We have 


= +°° 

3> 1 


and 


5Z A.? < +00 
j> 1 


where, for every j > 1 , Xj = Aj 7 2 (l + 2aj ||V<b(:r) 


and x is the solution of Problem (1531) . 

When g = 0, the SFB algorithm in ( 155b becomes equivalent 


to SGD I86H891. According to the above result, the conver¬ 


gence of SGD is ensured as soon as X]j>i = Too an d 
Xj "if < Too. In the unrelaxed case defined by A j = 1, 
we then retrieve a particular case of the decaying condition 
7 j oc j -1 / 2-5 with 5 G]0,1/2] usually imposed on the stepsize 
in the convergence studies of SGD under slightly different 
assumptions on the gradient estimates (uj)j> 1 (see SEl 
for more details). Note also that better convergence properties 
can be obtained, if a Polyak-Ruppert averaging approach is 
performed, i.e., the averaged sequence (xj)j >v defined as 
Xj = A "Yjl -1 x, for every j > 1 , is considered instead of 
( x j)j> 1 i n the convergence analysis |9jl 921. 

We now comment on approaches related to SFB that have 
been proposed in the literature to solve (f53l) . It should first be 
noted that a simple alternative strategy to deal with a possibly 
nonsmooth term g is to incorporate a subgradient step into 
the previously mentioned SGD algorithm [93] - However, this 
approach, like its deterministic version, may suffer from a slow 
convergence rate |94]. Another family of methods, close to 
SFB, adopt the regularized dual averaging (RDA) strategy, 
first introduced in |94||. The principal difference between 
SFB and RDA methods is that the latter rely on iterative 
averaging of the stochastic gradient estimates, which consists 
of replacing in the update rule (l55l) . (uj)^ >:L by (uj)j >1 
where, for every j > 1 , Uj = j 1 u > - The advantage 
is that it provides convergence guarantees for nondecaying 
stepsize sequences. Finally, the so-called composite mirror de¬ 
scent methods, introduced in [! 95 ]. can be viewed as extended 
versions of the SFB algorithm where the proximity operator is 
computed with respect to a non Euclidean distance (typically, 
a Bregman divergence). 

In the last few years, a great deal of effort has been 
made to modify SFB when the proximity operator of g o D 
does not have a simple expression, but when g can be split 
into several terms whose proximity operators are explicit. We 
can mention the stochastic proximal averaging strategy from 
|96|], the stochastic alternating direction method of mutipliers 
(ADMM) from [ 97 - 9^] and the alternating block strategy from 
[100] suited to the case when g o D is a separable function. 

Another active research area addresses the search for strate¬ 
gies to improve the convergence rate of SFB. Two main 
approaches can be distin guis hed in the literature. The first, 
adopted for example in [83;, 101- |l03j| , relies on subspace 
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acceleration. In such methods, usually reminiscent of Nes¬ 
terov’s acceleration techniques in the deterministic case, the 
convergence rate is improved by using information from 
previous iterates for the construction of the new estimate. 
Another efficient way to accelerate the convergence of SFB 
is to incorporate in the update rule second-order information 
one may have on the cost functions. For instance, the method 
described in 110411 incorporates quasi-Newton metrics into the 
SFB and RDA algorithms, and the natural gradient method 
from 110511 can be viewed as a preconditioned SGD algorithm. 
The two strategies can be combined, as for example, in & 
2) Adaptive filtering methods: In adaptive filtering, stochas¬ 
tic gradient-like methods have been quite popular for a long 
time E 3 Il08h . In this field, the functions often 


reduce to a least squares criterion 

(Vi > 1 ) <Pj{hjx, Vj ) = (hjx — Vj ) 


denotes the (possibly weighted) t\ norm and ( r \, p) e] 0 , +oo[ 2 . 
Over-relaxed projection algorithms allow such kind of prob¬ 
lems to be solved efficiently 1 124 125 1. 


(56) 


where x is the unknown impulse response. However, a specific 
difficulty to be addressed is that the designed algorithms must 
be able to deal with dynamical problems the optimal solution 
of which may be time-varying due to some changes in the 
statistics of the available data. In this context, it may be 
useful to adopt a multivariate formulation by imposing, at each 
iteration j > Q 

Vj — HjX (57) 

where y 3 = [ Vj ,..., yj _ Q+1 ] T , H, = [hj ,..., hj- Q+1 ] T , 

and Q > 1. This technique, reminiscent of mini-batch proce¬ 
dures in machine learning, constitutes the principle of affine 
projection algorithms , the purpose of which is to accelerate the 
convergence speed [109]. Our focus now switches to recent 
work which aims to impose some sparse structure on the 
desired solution. 

A simple method for imposing sparsity is to introduce a 
suitable adaptive preconditioning strategy in the stochastic 
gradient iteration, leading to the so-called proportionate least 
mean square method [ 110 , 111 ], which can be combined 
with affine projection techniques 1 112 , El. Similarly to 


C. Stochastic algorithms for solving deterministic optimiza¬ 
tion problems 

We now consider the deterministic optimization problem 
defined by (HD and ( l52l >. Of particular interest is the case 
when the dimensions N and/or M are very large (for instance, 
in CH, M = 2500000 and in [El, N = 100250). 

1) Incremental gradient algorithms: Let us start with in¬ 
cremental methods, which are dedicated to the solution of 
(ED when M is large, so that one prefers to exploit at 
each iteration a single term tpj, usually through its gradient, 
rather than the global function ip. There are many variants of 
incremental algorithms, which differ in the assumptions made 
on the functions involved, on the stepsize sequence, and on 
the way of activating the functions (jPi)i<i<M- This order 
could follow either a deterministic [128 ] or a randomized rule. 
However, it should be noted that the use of randomization 
in the selection of the components presents some benefits 
in terms of convergence rates Ei which are of particular 
interest in the context of machine learning 1 130i 131 1. where 
the user can only afford few full passes over the data. Among 
randomized incremental methods, the SAGA algorithm El, 
presented below, allows the problem defined in (ED to be 
solved when the function g is not necessarily smooth, by 
making use of the proximity operator introduced previously. 
The n-th iteration of SAGA reads as 


,2/in) 


the work already mentioned that has been developed in the 
machine learning community, a second approach proceeds 
by minimizing penalized criteria such as d53l > where g is 
a sparsity measure and D = /jy. In ill 14l ll 15tl . zero- 
attracting algorithms are developed which are based on the 
stochastic subgradient method. These algorithms have been 
further extended to affine projection techniques in 
_8£1. Proximal methods have also been proposed in the 
context of adaptive filtering, gro unded on the use of the 
forward-backward algorithm [ 119f], an accelerated version of 


— h'jri. ^ ( h' ln 5 Vjn ) ^ T.'/n (,^j n Z 

z i,m Vi) 

^jn ,Tl+l — 

Zi,n +1 = Zi,n Vi G {1, • • ■ , M } \ { jn\ 

x n+ i = prox D (x n - 7 u n ) 

(58) 

where 7 e] 0 ,+oo[, for all *g{ 1 ,...,M}, Zip = Xi e 
M iV , and j n is drawn from an i.i.d. uniform distribution on 
{1 ,,M}. Note that, although the storage of the variables 
(zi,n) 1<i<M can be avoided in this method, it is necessary to 

r . The 


store the M gradient vectors (hf\/<Pi(hj Zi iU ,yi )) 


it il2dl . or primal-dual approaches [121]. It is interesting to 
note that proportionate affine projection algorithms can be 
viewed as special cases of these methods [119([ . Other types 
of algorithms have been proposed which provide extensions 
of the recursive least squares method, which is known for 
its fast convergence properties 1 106 . 122 . El- Instead of 
minimizing a sparsity promoting criterion, it is also possible 
to formulate the problem as a feasibility problem where, at 
iteration j > Q, one searches for a vector x satisfying both 
su Pj<i<j-Q+i 12 H - h?x\ < 7 and ||as||i < p, where || • ||i 


convergence of Algorithm (1581) has been analyzed in 1132]. If 
the functions are /3 -1 -Lipschitz differentiable and 

/i-strongly convex with (/ 3 ,p) G]0, +oo[ 2 and the stepsize 7 
equals /3/(2(p/3M + 1)), then (E{||ir n - ®|| 2 }) n( =N goes to 
zero geometrically with rate 1 — 7, where x is the solution 
to Problem (ED . When only convexity is assumed, a weaker 
convergence result is available. 

The relationship between Algorithm (ED and other stochas¬ 
tic incremental methods existing in the literature is worthy of 
comment. The main distinction arises in the way of building 
the gradient estimates (u n )n>i ■ The standard incremental 
gradient algorithm, analyzed for instance in jl 29il . relies on 
simply defining, at iteration n, u n = hj rl J \7(pj n (hJ n x n ,yj n ). 
However, this approach, while leading to a smaller com¬ 
putational complexity per iteration and to a lower mem¬ 
ory requirement, gives rise to suboptimal convergence rates 
1911 [129], mainly due to the fact that its convergence requires 
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a stepsize sequence ( r y n ) n >i decaying to zero. Motivated by 
this observation, much recent work 1 126 . 13C)i-134 1 has been 
dedicated to the development of fast incremental gradient 
methods, which would benefit from the same convergence 
rates as batch optimization methods, while using a randomized 
incremental approach. A first class_of methods relies on a 
variance reduction approach G3 Il32l4l34ll which aims at 
diminishing the variance in successive estimates (u n )n>i- 
All of the aforementioned algorithms are based on iterations 
which are similar to (l58l >. In the stochastic variance reduction 
gradient method and the semi-stochastic gradient descent 
method proposed in 1331 [134], a full gradient step is made 
at every K iterations, K > 1, so that a single vector z n is 
used instead of (. Zi, n ) 1<i<M in the update rule. This so-called 
mini-batch strategy leads to a reduced memory requirement at 
the expense of more gradient evaluations. As pointed out in 
1 132], the choice between one strategy or another may depend 
on the problem size and on the computer architecture. In the 
stochastic average gradient algorithm (SAGA) from 03, a 
multiplicative factor 1/M is placed in front of the gradient 
differences, leading to a lower variance counterbalanced by a 
bias in the gradi ent estimates. It should be emphasized that 
the work in 1 130i. 133 1 is limited to the case when g = 0. A 
second class of methods, closely related to SAGA, consists 
of applying the proximal step to z n — 7 u n , where ~z n is 
the average of the variables (zi 1 „) 1 < i<M (which thus need 
to be stored). Thi s ap proach is retained for instance in the 
Finito algorithm II13 111 as well as in some instances of the 
minimization by incremental surrogate optimization (MISO) 
algorithm, proposed in G3. These methods are of particular 
interest when the extra storage cost is negligible with respect 
to the high computational cost of the gradients. Note that 
the MISO algorithm relying on the majoration-minimization 
framework employs a more generic update rule than Finito and 
has proven convergence guarantees even when ij is nonzero. 

2 ) Block coordinate approaches: In the spirit of the Gauss- 
Seidel method, an efficient approach for dealing with Prob¬ 
lem (ED when TV is large consists of resorting to block 
coordinate alternating strategies. Sometimes, such a block 


alternation can be performed in a deterministic manner 11135 


1136h . However, many optimization methods are based on fixed 
point algorithms, and it can be shown that with deterministic 
block coordinate strategies, the contraction properties which 
are required to guarantee the convergence of such algorithms 
are generally no longer satisfied. In turn, by resorting to 
stochastic techniques, these properties can be retrieved in some 
probabilistic sense [85]. In addition, using stochastic rules for 
activating the different blocks of variables often turns out to 
be more flexible. 

To illustrate why there is interest in block coordinate 
approaches, let us split the target variable x as [x p ,..., *^] T , 
where, for every k £ {1,..., K}, x k £ R Nk is the fc-th block 
of variables with reduced dimension TVfc (with TVi +■ • -+N = 
TV). Let us further assume that the regularization function can 
be blockwise decomposed as 


K 


g(Dx) = ^2 9iA x k) + g 2 ,k{D k x k ) 


(59) 


fc=1 


where, for every k £ {1, ..., K}, D k is a matrix in S. PkXNk , 
and gi tk : R Nk —)■] — 00, +00] and g 2 , k '■ —>] — 00, +00] 

are proper lower-semicontinuous convex functions. Then, the 
stochastic primal-dual proximal algorithm allowing us to solve 
Problem (151! is given by 


Algorithm 5 Stochastic primal-dual proximal algorithm 
for n = 1,2,... do 
for k = 1 to K do 

with probability e k £ (0,1] do 

v k ,n +1 = (Id-prox T -i S2 A>{Vk,n + D k X k , n ) 

ttt k ,n +1 = prOX,^^ fc (^ x k,n (/2v k ^_i n V k n ^ 

d" M Si=l h'i,kVipi(Y2k'=l h‘i,k ,x k l ,ni Vi 

otherwise 

= V k n , X kn -\-i = X kn . 

end for 
end for 


In the algorithm above, for every i £ {1,... ,M}, the scalar 
product hj x has been rewritten in a blockwise manner as 
2j fc ,_i h t k fX k >. Under some stability conditions on the choice 
of the positive step sizes r and 7, x n = [x p n ,..., n ] T 
converges almost surely to a so lution of the minimization 
problem, as n —> +00 (see 113711 for more technical details). 
It is important to note that the convergence result was estab¬ 
lished for arbitrary probabilities e = [sq,..., £k] T . provided 
that the block activation probabilities e k are positive and 
independent of n. Note that the various blocks can also be 
activated in a dependent manner at a giv en iteration n. Like 
its deterministic counterparts (see 113811 and the references 
therein), this algorithm enjoys the property of not requiring 
any matrix inversion, which is of paramount importance when 
the matrices (D k )i< k <K are of large size and do not have 
some simple forms. 

When g 2 , k = 0, the random block coordinate forward- 
backward algorithm is recovered as an instance of Algorithmic 
since the dual variables (itfe,n)i<fc<it:,nGN can be set to 0 
and the constant r becomes useless. An extensive literature 
exists on the latter algorithm and its variants. In particular, its 
almost sure convergence was established in 1851 under general 
conditions, whereas_some worst case convergence rates were 
derived in 1 139l - 143 1. In addition, if g\ k = 0, the random 
block coordinate descent algorithm is obtained 144]. 

When the objective function minimized in Problem ( l5l! 
is strongly convex, the random block coordinate forward- 
backward algorithm can be applied to the dual problem, in 
a similar fashion to the dual forward-backward method used 
in the deterministic case |145tl . This leads to so-called dual 
ascent strategies which have become quite popular in machine 
learning 1 146t - 149ll . 

Random block coordinate versions of other proximal algo¬ 
rithms such as the Douglas-Rachford algorithm and ADMM 
have also been proposed |85 Finally, it is worth em¬ 

phasizing that asynchronous distributed algorithms can be 
deduced from various randomly activated block coordinate 
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methods 1371115 111 . As well as dual ascent methods, the latter 
algorithms can also be viewed as incremental methods. 


V. Areas of intersection: 
optimization-within-MCMC and MCMC-driven 

OPTIMIZATION 

There are many important examples of the synergy between 
stochastic simulation and optimization, including global op¬ 
timization by simulated annealing, stochastic EM algorithms, 
and adaptive MCMC samplers |4|]. In this section we highlight 
some of the interesting new connections between modern 
simulation and optimization that we believe are particularly 
relevant for the SP community, and that we hope will stimulate 
further research in this community. 


A. Riemannian manifold MALA and HMC 

Riemannian manifold MALA and HMC exploit differential 
geometry for the problem of specifying an appropriate pro¬ 
posal covariance matrix £ that takes into account the geometry 
of the target density n [20], These new methods stem from the 
observation that specifying £ is equivalent to formulating the 
Langevin or Hamiltonian dynamics in an Euclidean parameter 
space with inner product (w,'£~ 1 x). Riemannian methods 
advance this observation by considering a smoothly-varying 
position dependent matrix £(a:), which arises naturally by for¬ 
mulating the dynamics in a Riemannian manifold. The choice 
of S(tc) then becomes the more familiar problem of specifying 
a metric or distance for the parameter space [201]. Notice 
that the Riemannian and the canonical Euclidean gradients 
are related by Vg( x) = '£(x)\7ij(x). Therefore this problem 
is also closely related to gradient preconditioning in gradient 
descent optimization discussed in Sec IV.B. Standard choices 
for £ include for example the inverse Hessian matrix 00. 
which is closely related to Newton’s optimization method, 
and the inverse Fisher information matrix [20], which is the 
“natural” metric from an information geometry viewpoint and 
is_also related to optimization by natural gradient descent 
[105]. These strategies have originated in the computational 
statistics community, and perform well in inference problems 
that are not too high-dimensional. Therefore, the challenge is 
to design new metrics that are appropriate for SP statistical 
models (see 111521115311 for recent work in this direction). 


B. Proximal MCMC algorithms 

Most high-dimensional MCMC algorithms rely particularly 
strongly on differential analysis to navigate vast parameter 
spaces efficieoptimizationntly. Conversely, the potential of 
convex calculus for MCMC simulation remains largely unex¬ 
plored. This is in sharp contrast with modern high-dimensional 
optimization described in Section IIVI where convex calculus 
in general, and proximity operators |8J. , EH in particular, 
are used extensively. This raises the question as to whether 
convex calculus and proximity operators can also be useful for 
stochastic simulation, especially for high-dimensional target 
densities that are log-concave, and possibly not continuously 
differentiable. 


This question was studied recently in [24] in the context of 
Langevin algorithms. As explained in Section II.B, Langevin 
MCMC algorithms are derived from discrete-time approxi¬ 
mations of the time-continuous Langevin diffusion process 
©. Of course, the stability and accuracy of the discrete 
approximations determine the theoretical and practical conver¬ 
gence properties of the MCMC algorithms they underpin. The 
approximations commonly used in the literature are generally 
well-behaved and lead to powerful MCMC methods. However, 
they can perform poorly if tv is not sufficiently regular, for 
example if tv is not continuously differentiable, if it is heavy¬ 
tailed, or if it has lighter tails than a Gaussian distribution. 
This drawback limits the application of MCMC approaches 
to many SP problems, which rely increasingly on models that 
are not continuously differentiable or that involve constraints. 

Using proximity operators, the following proximal approx¬ 
imation for the Langevin diffusion process © was recently 
proposed in 0 


x( t+1 ) 




prox-a 


log 7T 


(60) 


as an alternative to the standard forward Euler approximation 
A4 t+1 ) ~ + fVlogTr(XW) , <5I„) used in MALA0. 

Similarly to MALA, the time step S can be adjusted online 
to achieve an acceptance probability of approximately 50%. 
It was established in [24] that when 7r is log-concave, 
defines a remarkably stable discretization of © with optimal 
theoretical convergence properties. Moreover, the “proximal” 
MALA resulting from combining (f60b with an MH step has 
very good geometric ergodicity properties. In 0, the algo¬ 
rithm efficiency was demonstrated empirically on challenging 
models that are not well addressed by other MALA or HMC 
methodologies, including an image resolution enhancement 
model with a total-variation prior. Further practical assess¬ 
ments of proximal MALA algorithms would therefore be a 
welcome area of research. 


Proximity operators have also been used recently in [155!] 
for HMC sampling from log-concave densities that are not 
continuously differentiable. The experiments reported in 1551] 
show that this approach can be very efficient, in particular 
for SP models involving high-dimensionality and non-smooth 
priors. Unfortunately, theoretically analyzing HMC methods 
is difficult, and the precise theoretical convergence properties 
of this algorithm are not yet fully understood. We hope future 
work will focus on this topic. 


C. optimization-driven Gaussian simulation 

The standard approach for simulating from a multivariate 
Gaussian distribution with precision matrix Q £ M" x n is 
to perform a Cholesky factorization Q = L T L, generate an 
auxiliary Gaussian vector w ~ A/”(0, In), and then obtain the 
desired sample x by solving the linear system Lx = w S 
The computational complexity of this approach generally 
scales at a prohibitive rate 0(N 3 ) with the model dimension 
N, making it impractical for large problems, (note however 

4 Recall that prox (u) denotes the proximity operator of <p evaluated at 

v£ r n isnira. 
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that there are specific cases with lower complexity, for instance 
when Q is Toeplitz 1 157 1. circulant EH or sparse Q). 

optimization-driven Gaussian simulators arise from the ob¬ 
servation that the samples can also be obtained by minimizing 
a carefully designed stochastic cost function I159i.ll60fl . For il¬ 
lustration, consider a Bayesian model with Gaussian likelihood 
y\x ~ Af(Hx, S„) and Gaussian prior x ~ Af(xo,'S x ), 
for some linear observation operator H £ R ArxM , prior 
mean Xq £ R‘ V , and positive definite covariance matrices 
and Yi y £ R MxM . The posterior distribution 


IxN 


£ 

p(x\y) is Gaussian with mean y £ 
Q £ M. NxN given by 


and precision matrix 


Q = H T 'Z y 1 H + V~ l 

y = Q 1 (H T '£ y l y + ■ 

Simulating samples x\y ~ A f(y, Q 1 ) by Cholesky factoriza¬ 
tion of Q can be computationally expensive when N is large. 
Instead, optimization-driven simulators generate samples by 
solving the following “random” minimization problem 

x = argmin (w i — Hu) T {w\ — Hu) 

uG r n (61) 

+ [w 2 - u) T S' 1 (w 2 - u) 


with random vectors Wi ~ A f(y, S y ) and w 2 ~ AF(xq, S x ). 
It is easy to check that if ED is solved exactly, then x 
is a sample from the desired posterior distribution p(x\y). 
From a computational viewpoint, however, it is significantly 
more efficient to solve (ED approximately, for ex amp le by 
using a few linear conjugate gradient iterations 1 16011 . The 
approximation error can then be corrected by using an MH step 
[161], at the expense of introducing some correlation between 
the samples and therefore reducing the total effective sample 
size. Fortunately, there is an elegant strategy to determine auto¬ 
matically the optimal number of conjugate gradient iterations 
that maximizes the overall efficiency of the algorithm [161]]. 


VI. Conclusions and Observations 

In writing this paper we have sought to provide an in¬ 
troduction to stochastic simulation and optimization methods 
in a tutorial format, but which also raised some interesting 
topics for future research. We have addressed a variety of 
MCMC methods and discussed surrogate methods, such as 
variational Bayes, the Bethe approach, belief and expectation 
propagation, and approximate message passing. We also dis¬ 
cussed a range of recent advances in optimization methods 
that have been proposed to solve stochastic problems, as well 
as stochastic methods for deterministic optimization. Subse¬ 
quently, we highlighted new methods that combine simulation 
and optimization, such as proximal MCMC algorithms and 
optimization-driven Gaussian simulators. Our expectation is 
that future methodologies will become more flexible. Our com¬ 
munity has successfully applied computational inference meth¬ 
ods, as we have described, to a plethora of challenges across 
an enormous range of application domains. Each problem 
offers different challenges, ranging from model dimensionality 
and complexity, data (too much or too little), inferences. 


accuracy and computation times. Consequently, it seems not 
unreasonable to speculate that the different computational 
methodologies discussed in this paper will evolve to become 
more adaptable, with their boundaries becoming less well 
defined, and with the development of algorithms that make 
use of simulation, variational approximations and optimization 
simultaneously. Such an approach is more likely to be able to 
handle an even wider range of models, datasets, inferences, 
accuracies and computing times in a computationally efficient 
way. 
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